Efficient Discovery of Functional and Approximate Dependencies Using Partitions
نویسندگان
چکیده
Discovery of functional dependencies from relations has been identified as an important database analysis technique. In this paper, we present a new approach for finding functional dependencies from large databases, based on partitioning the set of rows with respect to their attribute values. The use of partitions makes the discovery of approximate functional dependencies easy and efficient, and the erroneous or exceptional rows can be identified easily. Experiments show that the new algorithm is efficient in practice. For benchmark databases the running times are improved by several orders of magnitude over previously published results. The algorithm is also applicable to much larger datasets than the previous methods. 1 Functional and approximate dependencies Functional dependencies are relationships between attributes of a relation: a functional dependency states that the value of an attribute is uniquely determined by the values of some other attributes. The discovery of functional dependencies from relations has received considerable interest (e.g., [2, 10, 17, 19, 11, 1, 6, 3]). Automated database analysis is, of course, interesting for knowledge discovery and data mining (KDD) purposes, and functional dependencies have applications in the areas of database management, reverse engineering [14, 20], and query optimization [21]. Formally, a functional dependency over a relation schema R is an expression X ! A, where X R and A 2 R. The dependency holds or is valid in a given relation r over R if for all pairs of rows t; u 2 r we have: if t[B] = u[B] for all B 2 X, then t[A] = u[A] (we also say that t and u agree on X and A). A functional dependency X ! A is minimal (in r) if A is not functionally dependent on any proper subset of X, i.e., if Y ! A does not hold in r for any Y X. The dependency X ! A is trivial if A 2 X. The central task we consider is the Also at Rolf Nevanlinna Institute, University of Helsinki. following: given a relation r, find all minimal non-trivial dependencies that hold in r. An approximate dependency [5] is a functional dependency that almost holds. Such dependencies arise in many databases when there is a natural dependency between attributes, but some rows contain errors or represent exceptions to the rule. The discovery of unexpected but meaningful approximate dependencies seems to be an interesting and realistic goal in many data mining applications. There are many possible ways of defining the approximateness of a dependency X ! A. The definition we use is based on the minimum number of rows that need to be removed from the relation r forX ! A to hold in r: the error g3(X ! A) = 1 (maxfjsj j s r and X ! A holds in sg)=jrj [5]. The measure g3 has a natural interpretation as the fraction of rows with exceptions or errors affecting the dependency. Given an error threshold ", 0 " 1, we say that X ! A is an approximate dependency if and only if g3(X ! A) is at most ". In this paper, we also consider the approximate dependency inference task: given a relation r and a threshold ", find all minimal non-trivial approximate dependencies. We describe a new approach to the discovery of both functional and approximate dependencies. The major innovation is a novel way of determining whether a dependency holds or not. The idea is to maintain information about which rows agree on a set of attributes. Formally, the approach can be described using equivalence classes and partitions. A major advantage of the use of partitions is that it allows efficient discovery of approximate dependencies. The algorithm is based on the levelwise algorithm that has been used in many data mining applications [12]. It starts from dependencies with a small left-hand side, i.e., from the ones that are not very likely to hold. The algorithm then works towards larger and larger dependencies, until the minimal dependencies that hold are found. The worst case time complexity of the algorithm with respect to the number of attributes is exponential, but this is inevitable since the number of minimal dependencies can be exponential in the number of attributes [10, 9]. However, if the number of rows increases but the set of dependencies stays the same, the time increases only linearly in the number of rows. To our knowledge, only one previous algorithm can claim this [18]. Other algorithms based on sorting could perhaps be implemented in linear time, e.g., by using hashing, but we are not aware of such implementations. The linearity makes the algorithm especially suitable for relations with large number of rows. Experimental results show that the algorithm is effective in practice, and that it makes the discovery of functional and approximate dependencies feasible for relations with even hundreds of thousands of rows. Dependency discovery tasks that have been reported to take minutes or even hours are solved with the new algorithm in seconds or fractions of a second on a PC. Related work Several algorithms for the discovery of functional dependencies have been presented [7, 2, 9, 18, 17, 11, 1]. We review these algorithms and compare them with our method in Section 6. The complexity of discovering functional dependencies has been studied in [8, 10, 9]. Approximate functional dependencies have been considered in [5, 15, 6, 3]. Kivinen and Mannila [5] define several measures for the error of a dependency, and derive bounds for discovering dependencies with errors. The measure g3 is one of their measures. The use of partitions to describe and define functional and approximate dependencies has been suggested in [3] parallel to our work. There the emphasis is on a conceptual viewpoint, and no algorithms are given. Extended version An extended version of this article, with proofs and additional details, is available as [4]. An implementation of the algorithm can be obtained via the WWW page at http://www.cs.helsinki.fi/ research/fdk/datamining/tane/. 2 Partitions and dependencies Informally, a dependency X ! A holds if all rows that agree on X also agree on A. Our approach to the discovery of dependencies is based on considering sets of rows that agree on some set of attributes. We describe this idea more formally by applying equivalence classes and partitions on relations. Partitions Two rows t and u are equivalent with respect to a given set X of attributes if t[A] = u[A] for all A in X. Any attribute set X partitions the rows of the relation into equivalence classes. We denote the equivalence class of a row t 2 r with respect to a given set X R by [t]X, i.e., [t]X = fu 2 r j t[A] = u[A] for all A 2 Xg: The set X = f[t]X j t 2 rg of equivalence classes is a partition of r under X. That is, X is a collection of disjoint sets (equivalence classes) of rows, such that each set has a unique value for the attribute set X, and the union of the sets equals the relation r. The rank j j of a partition is the number of equivalence classes in . Row Id A B C D
منابع مشابه
Discovery of functional and approximate functional dependencies in relational databases
This study develops the foundation for a simple, yet efficient method for uncovering functional and approximate functional dependencies in relational databases. The technique is based upon the mathematical theory of partitions defined over a relation’s row identifiers. Using a levelwise algorithm the minimal non-trivial functional dependencies can be found using computations conducted on intege...
متن کاملEecient Discovery of Functional and Approximate Dependencies Using Partitions (extended Version)
متن کامل
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
The discovery of functional dependencies from relations is an important database analysis technique. We present TANE, an efficient algorithm for finding functional dependencies from large databases. TANE is based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples. The use of p...
متن کاملApproximation Measures for Conditional Functional Dependencies Using Stripped Conditional Partitions
Received Apr 11, 2017 Revised May 5, 2017 Accepted May 24, 2017 Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for mor...
متن کاملDynamic Discovery of Fuzzy Functional Dependencies Using Partitions
A functional dependency describes the relationship between attributes in a database relation. It states that the value of an attribute is uniquely determined by the values of some other attributes. It serves as a constraint between the attributes and is being used in the normalization process of relational database design. Therefore the discovery of functional dependencies from databases has be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998